| time | topic |
|---|---|
| 15 | Initial data analysis |
| 30 | Exploring data |
| 15 | constructing null samples |
| 20 | Wrap-up: questions, discussion, other topics |
| time | topic |
|---|---|
| 15 | Initial data analysis |
| 30 | Exploring data |
| 15 | constructing null samples |
| 20 | Wrap-up: questions, discussion, other topics |
The first thing to do with data is to look at them …. usually means tabulating and plotting the data in many different ways to see what’s going on. With the wide availability of computer packages and graphics nowadays there is no excuse for ducking the labour of this preliminary phase, and it may save some red faces later.
Crowder, M. J. & Hand, D. J. (1990) “Analysis of Repeated Measures”
IDA includes:
World Development Indicators, 2004-2022 data for selected series.
Rows: 90,972
Columns: 4
$ country_code <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AF…
$ series_code <chr> "EG.CFT.ACCS.ZS", "EG.CFT.ACCS.ZS", "EG.CFT.ACCS.ZS"…
$ year <dbl> 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012…
$ value <dbl> 10.5, 11.9, 13.5, 15.1, 16.6, 18.3, 19.9, 21.3, 22.9…
Tidy format variables are:
In long form, it can be pivoted in different ways to explore missing values on series, countries, and years.
Handling missings strategy:
Read more at R-miss-tastic.
Hmm, what happened?
Illustrations from Julia Lowndes and Allison Horst
Analysis flows more neatly with tidy data format.
OECD PISA data, sample from 2018
Rows: 612,004
Columns: 22
$ year <fct> 2018, 2018, 2018, 2018, 2018…
$ country <fct> ALB, ALB, ALB, ALB, ALB, ALB…
$ school_id <fct> 800002, 800002, 800002, 8000…
$ student_id <fct> 800251, 800402, 801902, 8035…
$ mother_educ <fct> "ISCED 3A", "ISCED 2", "ISCE…
$ father_educ <fct> "ISCED 3A", "ISCED 2", "ISCE…
$ gender <fct> male, male, female, male, ma…
$ computer <fct> yes, yes, no, no, yes, yes, …
$ internet <fct> yes, yes, no, no, yes, yes, …
$ math <dbl> 490.187, 462.464, 406.949, 4…
$ read <dbl> 375.984, 434.352, 359.191, 4…
$ science <dbl> 445.039, 421.731, 392.223, 5…
$ stu_wgt <dbl> 13.51452, 13.51452, 9.50669,…
$ desk <fct> yes, yes, yes, yes, yes, yes…
$ room <fct> yes, yes, yes, no, yes, yes,…
$ dishwasher <fct> NA, NA, NA, NA, NA, NA, NA, …
$ television <fct> 3+, 1, 1, 0, 2, 1, NA, 1, 1,…
$ computer_n <fct> 1, 1, 0, 0, 1, 1, NA, 0, 1, …
$ car <fct> 2, 2, 0, NA, 0, NA, NA, 0, 2…
$ book <fct> 0-10, 11-25, 0-10, 0-10, 11-…
$ wealth <dbl> -0.0996, -0.7221, -3.6051, -…
$ escs <dbl> 0.6747, -0.7566, -2.5112, -3…
Math gap is not universal. 😱
There are now many countries where girls score higher on average than boys.
On the other hand, the reading gap is universal. Girls universally score higher than boys on average. 🤯
Log(wages) of 888 individuals, measured at various times in their employment US National Longitudinal Survey of Youth.
Wages tend to increase as time in the workforce gets longer, on average.
The higher the education level achieved, the higher overall wage, on average.
Measuring interesting
Compute longnostics for each subject, for example,
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.